36 research outputs found
Analyzing Resource Utilization in an HPC System: A Case Study of NERSC Perlmutter
Resource demands of HPC applications vary significantly. However, it is
common for HPC systems to primarily assign resources on a per-node basis to
prevent interference from co-located workloads. This gap between the
coarse-grained resource allocation and the varying resource demands can lead to
HPC resources being not fully utilized. In this study, we analyze the resource
usage and application behavior of NERSC's Perlmutter, a state-of-the-art
open-science HPC system with both CPU-only and GPU-accelerated nodes. Our
one-month usage analysis reveals that CPUs are commonly not fully utilized,
especially for GPU-enabled jobs. Also, around 64% of both CPU and GPU-enabled
jobs used 50% or less of the available host memory capacity. Additionally,
about 50% of GPU-enabled jobs used up to 25% of the GPU memory, and the memory
capacity was not fully utilized in some ways for all jobs. While our study
comes early in Perlmutter's lifetime thus policies and application workload may
change, it provides valuable insights on performance characterization,
application behavior, and motivates systems with more fine-grain resource
allocation
PaST-NoC: A Packet-Switched Superconducting Temporal NoC
Temporal computing promises to mitigate the stringent area constraints and
clock distribution overheads of traditional superconducting digital computing.
To design a scalable, area- and power-efficient superconducting network on chip
(NoC), we propose packet-switched superconducting temporal NoC (PaST-NoC).
PaST-NoC operates its control path in the temporal domain using race logic
(RL), combined with bufferless deflection flow control to minimize area.
Packets encode their destination using RL and carry a collection of data pulses
that the receiver can interpret as pulse trains, RL, serialized binary, or
other formats. We demonstrate how to scale up PaST-NoC to arbitrary topologies
based on 2x2 routers and 4x4 butterflies as building blocks. As we show, if
data pulses are interpreted using RL, PaST-NoC outperforms state-of-the-art
superconducting binary NoCs in throughput per area by as much as 5x for long
packets.Comment: 14 pages, 18 figures, 2 tables. In press in IEEE Transactions on
Applied Superconductivit
Efficient Intra-Rack Resource Disaggregation for HPC Using Co-Packaged DWDM Photonics
The diversity of workload requirements and increasing hardware heterogeneity
in emerging high performance computing (HPC) systems motivate resource
disaggregation. Resource disaggregation allows compute and memory resources to
be allocated individually as required to each workload. However, it is unclear
how to efficiently realize this capability and cost-effectively meet the
stringent bandwidth and latency requirements of HPC applications. To that end,
we describe how modern photonics can be co-designed with modern HPC racks to
implement flexible intra-rack resource disaggregation and fully meet the bit
error rate (BER) and high escape bandwidth of all chip types in modern HPC
racks. Our photonic-based disaggregated rack provides an average application
speedup of 11% (46% maximum) for 25 CPU and 61% for 24 GPU benchmarks compared
to a similar system that instead uses modern electronic switches for
disaggregation. Using observed resource usage from a production system, we
estimate that an iso-performance intra-rack disaggregated HPC system using
photonics would require 4x fewer memory modules and 2x fewer NICs than a
non-disaggregated baseline.Comment: 15 pages, 12 figures, 4 tables. Published in IEEE Cluster 202
Understanding Quantum Control Processor Capabilities and Limitations through Circuit Characterization
Continuing the scaling of quantum computers hinges on building classical
control hardware pipelines that are scalable, extensible, and provide real time
response. The instruction set architecture (ISA) of the control processor
provides functional abstractions that map high-level semantics of quantum
programming languages to low-level pulse generation by hardware. In this paper,
we provide a methodology to quantitatively assess the effectiveness of the ISA
to encode quantum circuits for intermediate-scale quantum devices with
O() qubits. The characterization model that we define reflects
performance, the ability to meet timing constraint implications, scalability
for future quantum chips, and other important considerations making them useful
guides for future designs. Using our methodology, we propose scalar (QUASAR)
and vector (qV) quantum ISAs as extensions and compare them with other ISAs in
metrics such as circuit encoding efficiency, the ability to meet real-time gate
cycle requirements of quantum chips, and the ability to scale to more qubits.Comment: 10 pages, 8 figure
Pre-Configured Routes
In multi-core ASICs, processors and other compute engines need to communicate with memory blocks and other cores with latency as close as possible to the ideal of a direct buffered wire. However, current state of the art networks-on-chip (NoCs) suffer, at best, latency of one clock cycle per hop. We investigate the design of a NoC that offers close to the ideal latency in some preferred, run-time configurable paths. Processors and other compute engines may perform network reconfiguration to guarantee low latency over different sets of paths as needed. Flits in non-preferred paths are given lower priority than flits in preferred paths to enable the latter to provide low latency. To achieve our goal, we extend the “mad-postman ” technique [1]: every incoming flit is eagerly (i.e. speculatively) forwarded to the input’s preferred output, if any. This is accomplished with the mere delay of a single pre-enabled tri-state driver. We later check if that decision was correct, and if not, we forward the flit to the proper output. Incorrectly forwarded flits are classified as dead, and are eliminated in later hops
Recommended from our members
Variable-Width Datapath for On-Chip Network Static Power Reduction
With the tight power budgets in modern large-scale chips and the unpredictability of application traffic, on-chip network designers are faced with the dilemma of designing for worst- case bandwidth demands and incurring high static power overheads, or designing for an average traffic pattern and risk degrading performance. This paper proposes adaptive bandwidth networks (ABNs) which divide channels and switches into lanes such that the network provides just the bandwidth necessary in each hop. ABNs also activate input virtual channels (VCs) individually and take advantage of drowsy SRAM cells to eliminate false VC activations. In addition, ABNs readily apply to silicon defect tolerance with just the extra cost for detecting faults. For application traffic, ABNs reduce total power consumption by an average of 45percent with comparable performance compared to single-lane power-gated networks, and 33percent compared to multi-network designs